Statistical computation and visualization (MATH-517)
Transformations to talk :
Here are some packages used in our project. Make sure they are installed before running the R Markdown file. Since the format of this Markdown is not standard, you need to install the Distill package. To avoid errors, install it via Github (the package is updated there) with the following commands:
install.packages("devtools")
devtools::install_github("rstudio/distill")
Wildfires are uncontrolled fires of fuel materials composed of mostly natural vegetation, such as woods or shrublands. They represent an environmental threat with major impacts all over the world, and their frequency of outbreak and severity are expected to increase with global warming Jones et al. 2020. Each year, wildfires cause many direct fatalities in humans, as well as extreme air pollution events and loss of both biodiversity and other valuable ecosystem services. They also contribute a substantial fraction of global greenhouse gas emissions each year, so they can further worsen climate change. The modeling of wildfires has been approached using a wide variety of statistical approaches (see, e.g., Preisler et al. 2004, Pereira & Turkman 2019, Xi et al. 2019), some of which employ extreme value theory (see, e.g., Pereira & Turkman 2019).
As for the origin of wildfire starts, lightning is the main natural cause, but in the overwhelming majority of cases, human activities are responsible. They can be intentional (arson, children) or accidental (burning debris, agricultural activities, campfires, smoking). In general, wildfires are the result of a combination of the presence of combustible materials (e.g., a forest), their easy flammability (e.g., as a result of extreme weather conditions such as droughts), and a triggering event (e.g., lightning or human activity). In most fire-prone regions of the world, wildland fire activity has seasonal cycles due to the seasonality of favorable weather conditions.
To assist in wildland fire management, it is essential to understand and predict the risk factors that contribute to wildland fires and their spatial and seasonal distribution. Wildland fire management involves a multitude of tasks, including monitoring forest ecosystems, deploying preventative measures, firefighting logistics, and short-term forecasting and long-term projections of wildland fire activity.
With the chosen data set, we can concentrate on two important components of wildland fire activeness: fire occurrence and fire magnitude. Given a region of space and time, we consider the number of spatially separated wildfire events as an observation with respect to the first aspect (occurrence), and the aggregate burned area of wildfires originating from the area of interest as an observation with respect to the second aspect (size).
The dataset contains 563,983 rows for 37 columns. The columns are the following:
Please note that the area proportions lc1 to lc18 do not always sum to exactly 1 for each pixel and month since a few classes with quasi-0 proportion have been removed.
Since the original data was given under the context of a prediction competition with the university of Edimburgh, there is a 8,000 of missing values in each of the columns CNT and BA. The missing values are not located necessarly in the same lines for the two features.
We want to determine if there is a significant correlation between the different land covers and the appearance of fires. In order to do so, we use a database filtered of missing data. As can be seen in Figure (?), the correlation between the fires remains very low. To confirm this statement, we plot in Figure(?) the variables that have been defined as highly correlated in red and the rest in light blue. We can then see that there is no strong correlation between the columns nor between the columns and the apparition of the fires. We define a correlation as strong when its absolute value exceeds 0.8.
In this section, we will study the distribution of land covers over time. To do so, we have selected the geographical coordinates of Malibu, California. This area has been heavily impacted by wildfires in recent years. In November 2018, the wealthy coastal enclave of Malibu was engulfed by the Woolsey Fire, which spread to over 96,000 acres of land outside of LA and is now 35% contained. At least two people were pronounced dead in Malibu on Friday. Their burned bodies were found in a car near Mulholland Highway. In Figure (), we can see the exact location of the studied point on the map of the United States.
In Figure (), we can see the distribution of the landcovers with time. We can see that the proportions are relatively stable and consistent with time. The predominant cover is the urban area with 31% of the surface in 1993 this proportion only increases with time.
Although in this particular example of Malibu, the proportion of different landcovers seems consistent, this is not always the case. Taking the coordinates in Figure () in Wisconsin, we can see that the proportions change drastically depending on the year. The percentage of land representing tree broadleaved evergreen closed to open increases with time until it becomes the predominant one, while shrub or herbaceous cover flooded decreases drastically with time as seen in Figure(). We will not analyze the reasons and parameters that may have motivated this change, but we will try to quantify this change in the next section.
To access the intarctive part, click this link. In this app you will able to select the longitude and lattitude wanted and see the distribution of the landcovers.
We will first proceed by representing the territory according to the predominant type of coverage. Figure () shows this arrangement in the year 2000. We observe that water dominates the maritime borders, tree broadleaved are predominant in the east of the country while the west is dominated by tree needleleave. We can easily identify the cities of New York, Los angeles and other large cities by their predominance of urban space.
To access the intarctive part, click this link. In this app you will able to select the year and the predominant landcover will display directly on the us map.
To analyze the variation, we create a vector whose values start at zero for any given location and increases by one as soon as the predominant coverage area changes from one month to the next. What is first interesting to note is that all locations change their predominant type of space at least once. Indeed in Figure (), we see that the shift values start at 2. We also see that the shifts do not exceed 10, knowing that this measurement is taken over 22 years, this is still relatively low.
We want to understand the correlation between land cover and meteorological variables. To do so, we are going to clean our data first, rename the variables by their description so that we can easily analyze our results and then compute the correlation between the variables. The problem is that when trying to output the matrix, we get a warning saying that the matrix is too big. To get an idea of what variable can be highly related we are going to output the correlation between a pair of land cover and meteorological variable when it is bigger than a threshold manually set.
[1] "Important correlations for the land cover variable number 1: 0"
[1] "Important correlations for the land cover variable number 2: 0"
[1] "Important correlations for the land cover variable number 3: 0"
[1] "Important correlations for the land cover variable number 4: 0"
[1] "Important correlations for the land cover variable number 5: 0"
[1] "Important correlations for the land cover variable number 6: 0"
[1] "Important correlations for the land cover variable number 7: 0"
[1] "Important correlations for the land cover variable number 8: 0"
[1] "Important correlations for the land cover variable number 9: 0"
[1] "Important correlations for the land cover variable number 10: 0"
[1] "Important correlations for the land cover variable number 11: 0"
[2] "Important correlations for the land cover variable number 11: 7"
[1] "Important correlations for the land cover variable number 12: 0"
[1] "Important correlations for the land cover variable number 13: 0"
[1] "Important correlations for the land cover variable number 14: 0"
[1] "Important correlations for the land cover variable number 15: 0"
[1] "Important correlations for the land cover variable number 16: 0"
[1] "Important correlations for the land cover variable number 17: 0"
[1] "Important correlations for the land cover variable number 18: 0"
Here, we choose a threshold of 0.6 and none of the correlations has been outputted. This means that all of them are less that 0.6. In fact, when printing all the correlations (threshold = 0) we remark that the values are very small. This can come from the fact that there are no correlations between the variables or that the correlation is not linear here. So we are going to compute the correlation with spearman method.
[1] "Important correlations for the land cover variable number 1: 0"
[1] "Important correlations for the land cover variable number 2: 0"
[1] "Important correlations for the land cover variable number 3: 0"
[1] "Important correlations for the land cover variable number 4: 0"
[1] "Important correlations for the land cover variable number 5: 0"
[1] "Important correlations for the land cover variable number 6: 0"
[1] "Important correlations for the land cover variable number 7: 0"
[1] "Important correlations for the land cover variable number 8: 0"
[1] "Important correlations for the land cover variable number 9: 0"
[1] "Important correlations for the land cover variable number 10: 0"
[1] "Important correlations for the land cover variable number 11: 0"
[2] "Important correlations for the land cover variable number 11: 7"
[1] "Important correlations for the land cover variable number 12: 0"
[1] "Important correlations for the land cover variable number 13: 0"
[1] "Important correlations for the land cover variable number 14: 0"
[1] "Important correlations for the land cover variable number 15: 0"
[1] "Important correlations for the land cover variable number 16: 0"
[2] "Important correlations for the land cover variable number 16: 8"
[1] "Important correlations for the land cover variable number 17: 0"
[1] "Important correlations for the land cover variable number 18: 0"
[2] "Important correlations for the land cover variable number 18: 8"
Now by changing the method, we observe that we get that only the land cover variable 11 (shrubland) and the surface net thermal radiation are correlated. But the results are not as we expected, so we are going to regroup land covers that are similar.
First, to analyze the spread of a fire, we are going to look for the distribution of the number of fires in a grid during a month.
We remark that there most of the values next to 0 and because of the outliers we can’t see the distribution very good. So we are going to filter the outliers to have a better view of the distrubution
Now, since we want to predict the spread of a fire given the land covers data, let’s plot the distribution of the number of fire according to each land cover variable.
In this section, we try to model the fact that there was a fire in a grid for a given month given the land cover specification of the grid. The problem is that we have the number of fire in the grid. One can think of using the zero inflated Poisson regression. In fact, the zero inflated Poisson is used to count data with excess zeros and overdispersion, which describe well our data. It combines the Poisson distribution and the logit distribution.
Call:
zeroinfl(formula = CNT ~ lc1 + lc2 + lc3 + lc4 + lc5 + lc6 +
lc7 + lc8 + lc9 + lc10 + lc11 + lc12 + lc13 + lc14 + lc15 +
lc16 + lc17 + lc18, data = df)
Pearson residuals:
Min 1Q Median 3Q Max
-2.6002 -0.6556 -0.4760 -0.1269 164.8916
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.4275 0.1150 -3.718 0.000201 ***
lc1 3.8936 0.1317 29.561 < 2e-16 ***
lc2 1.7546 0.1152 15.234 < 2e-16 ***
lc3 0.1621 0.1427 1.136 0.255880
lc4 2.5901 0.1170 22.144 < 2e-16 ***
lc5 2.3983 0.1148 20.897 < 2e-16 ***
lc6 -7.7769 0.4706 -16.526 < 2e-16 ***
lc7 2.6995 0.1149 23.490 < 2e-16 ***
lc8 2.3755 0.1158 20.514 < 2e-16 ***
lc9 1.0370 0.1172 8.845 < 2e-16 ***
lc10 4.5284 0.1232 36.762 < 2e-16 ***
lc11 1.5169 0.1153 13.158 < 2e-16 ***
lc12 1.9933 0.1155 17.256 < 2e-16 ***
lc13 -1.5373 0.1658 -9.271 < 2e-16 ***
lc14 2.7590 0.1212 22.764 < 2e-16 ***
lc15 2.5966 0.1282 20.260 < 2e-16 ***
lc16 5.8753 0.1176 49.973 < 2e-16 ***
lc17 -0.3846 0.1310 -2.937 0.003314 **
lc18 1.4079 0.1162 12.121 < 2e-16 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.8587 0.4163 -23.68 < 2e-16 ***
lc1 9.8726 0.4567 21.62 < 2e-16 ***
lc2 11.9956 0.4168 28.78 < 2e-16 ***
lc3 12.7570 0.4947 25.79 < 2e-16 ***
lc4 9.9462 0.4243 23.44 < 2e-16 ***
lc5 10.0305 0.4162 24.10 < 2e-16 ***
lc6 18.0285 1.0028 17.98 < 2e-16 ***
lc7 8.0604 0.4164 19.36 < 2e-16 ***
lc8 9.4954 0.4201 22.60 < 2e-16 ***
lc9 11.0770 0.4201 26.37 < 2e-16 ***
lc10 1.2508 0.4685 2.67 0.00759 **
lc11 10.7596 0.4172 25.79 < 2e-16 ***
lc12 10.8828 0.4172 26.08 < 2e-16 ***
lc13 16.6698 0.5593 29.81 < 2e-16 ***
lc14 8.6888 0.4388 19.80 < 2e-16 ***
lc15 9.2563 0.4608 20.09 < 2e-16 ***
lc16 7.3972 0.4386 16.87 < 2e-16 ***
lc17 12.1465 0.4312 28.17 < 2e-16 ***
lc18 11.4647 0.4180 27.43 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 43
Log-likelihood: -1.138e+06 on 38 Df
In this section, we try to find some meteorological factors that could potentially trigger a wildfire. To do so, we will plot a Pearson correlation heatmap. The first step is to get a subset from our initial dataset that is more representative of wildfire conditions. As a fire is quite a rare event, we will not be able to find any links if we were considering the entire population (the correlations would be around 0).
To get our subset, we will only consider areas per month which have at least a certain number of wildfires and which have been burnt above a certain threshold. For our example, we take rows with more than 60 wildfires (\(CNT \geq 60\)) and 5000 acres of aggregated burnt area (\(BA \geq 5000\), approximately 20km2) in the current month.
We check whether the period in years of the subset matches that of the initial dataset, i.e. 1993 to 2015. We see that it does.
range(selection$year)
[1] 1993 2015
As a result, a correlation heatmap will allow us to distinguish certain factors that could favor a wildfire.
Since we are looking for meteorological risk factors, we may only consider the first two rows or columns of the heatmap.
First, we see that there are no pairs of fully correlated variables. Indeed, the highest value (absolute) is 0.43. The number of wildfires and months are negatively correlated. This can be explained by the fact that months range from March to September and that there are more occurrences of fire in March and April than in August and September. The fact that the numbers are small shows that the causes of a wildfire are meteorologically multifactorial.
An interesting observation is that \(CNT\) and \(BA\) are negatively correlated, meaning that the more area burnt, the fewer wildfires there are. Most of the time, the signs of correlations in the first two rows are not the same for these two variables. For example, by looking at temperatures and solar radiation, the higher they are, the more area burnt, but the less wildfires. This can be explained by the fact that our subset contains substantial wildfires, so there aren’t many, but they are destructive.
Another variable that we can comment on is precipitation. For an area to be burnt, there must be no rain or high humidity conditions, hence the negative correlations between \(BA\) and precipitation. The positive correlation with \(CNT\) could be explained by unstable weather conditions. As with wind speed, the higher it is, the more unstable the air masses and the greater the risk of a natural disaster.
But we should be careful with these correlations as we used a subset and as their values are not that high in absolute value.
Next, a map of the US represents the areas considered in the subset, regardless of months and years.
It can be seen that the areas represented are not that many in number and are located where the forest cover is the most important. For example, there are almost no red squares in the central region of the United States, as this area is mostly non-forest land.
For other thresholds, an interactive app is available here. Instructions and explanations are displayed on the app. Basically, you have the same results as above, but thresholds for \(CNT\) and \(BA\) are entered by the user.